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Abstract 

In Bayesian statistics the precise point-null hypothesis 6 = Oq can be tested 
by checking whether 6q is contained in a credible set. This permits testing of 
6 = 9q without having to put prior probabilities on the hypotheses. While 
such inversions of credible sets have a long history in Bayesian inference, 
they have been criticised for lacking decision-theoretic justification. 

We argue that these tests have many advantages over the standard 
Bayesian tests that use point-mass probabilities on the null hypothesis. We 
present a decision-theoretic justification for the inversion of central credible 
intervals, and in a special case HPD sets, by studying a three-decision prob- 
lem with directional conclusions. Interpreting the loss function used in the 
justification, we discuss when test based on credible sets are applicable. 

We then give some justifications for using credible sets when testing 
composite hypotheses, showing that tests based on credible sets coincide 
with standard tests in this setting. 

Keywords: Bayesian inference; credible set; confidence interval; decision 
theory; directional conclusion; hypothesis testing; three-decision problem. 



1 Introduction 

The first step of the standard solution to Bayesian testing of the point-null (or 
sharp, or precise) hypothesis 9 E Qq = {Oq} is to assign a prior probability to the 
hypothesis. This is necessary if one wishes to use e.g. tests based on Bayes factors, 
as an absolutely continuous prior for 9 would yield P{9 E Qo\x) = regardless of 



the data x. We shall refer to this procedure, described in detail e.g. by Robert 



(2007, Section 5.2.4), as the standard test or the standard solution in the following. 
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A continuous prior can however be utilized in a simpler alternative strategy, in 
which the evidence against 0o is evaluated indirectly by using inverted credible 
sets for testing. In this procedure a credible set 0c, such that P{9 e Qc\x) > 1 — a, 
is computed and the null hypothesis is rejected if 6*0 ^ 0c- Using credible sets for 
testing avoids the additional complication of imposing explicit probabilities on the 
hypotheses, avoids Lindley's paradox and allows for the use of non-informative 
priors. 

While it has been argued that tests of point-null hypotheses are unnatural from 
a Bayesian point of view, they are undeniably of considerable practical interest 
(Berger & Delampady, 1987; Ghosh et al., 2006, Section 2.7.2; Robert, 2007, 
Section 5.2.4). Inverting credible sets is arguably the simplest way to test such 
hypotheses. For this reason, inverting credible sets as a means to test hypotheses 
has long been, and continues to be, a part of Bayesian statistics. Tests of this type 



have been studied by Box & Tiao (1965), Hsu (1982), Kim (1991), Drummond & 



Rambaut (2007) and Kruschke (2011), among many others. Zellner (1971 Section 



10.2) credited this procedure to Lindley (1965, Section 5.6), who in turn credited 
it to Jeffreys (1961). Lindley suggested that it may be useful when the prior 
information is vague or diffuse, with the caveat that the distribution of 6 must be 
resonably smooth around 6*0 for the inversion to be sensible. Koch (2007, Section 



3.4) motivated this procedure by its resemblance to frequentist methods. Ghosh et 



al. (2006, Section 2.7) described the inversion of credible sets as "a very informal 
way of testing". Zellner remarked that "no decision theoretic justification appears 
available for this procedure". 

The purpose of this short note is to discuss such justifications and to put 
inverted credible sets on a decision-theoretic footing, making them formal Bayesian 
tests. These justifications shed light on when tests based on credible sets should 
be used. Throughout the text we assume that the parameter space C M and 
that the posterior distribution of 9 is proper and absolutely continuous. 

We will study three types of credible sets. A highest posterior density (HPD) 
set contains all 6 for which the posterior density is greater than some /uq,. Its 
appeal lies in the fact that among all 1 — a credible sets, the HPD set has the 
shortest length. 

With qci{0\x) denoting the posterior quantile function of 9, such that P(^^ < 
qa{9\x)) = a, the interval {qa/2{9\x) , qi-a/2{9\x)) is called a 1 — a central interval 
(also known as an equal-tailed, symmetric or probability centred interval). Central 
intervals are often used due to the fact that they are computationally simpler than 
HPD sets, especially when the HPD set is not connected. They are also more in 
line with frequentist practice. 

A credible set is a credible bound (or a one-sided credible interval) if it is of 
the form {9 : 9 < qi-a{9\x)} or {9 : 9 > qa{9\x)}. Such sets are of interest when a 
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lower or upper bound for the unknown parameter 6 is desired. 

First, we consider tests of the point-null hypothesis Gq = {do}. We begin 
by comparing test based on credible sets compared to the standard solution with 
point-mass priors in Section 2A_ arguing that the former have many advantages. 
We then review a decision-theoretic justification of inverted HPD sets proposed 
by Madruga et al. (2001) in Section 2.2 This justification involves a loss function 
that is non-standard in that it depends on the sample x. In Section 2^ we propose 
a justification of inverted central intervals by recasting the problem of testing Go 
in a three-decision setting. This allows us to use a weighted 0-1 loss to justify 
tests based on central intervals, avoiding losses that depend on x. In Section 2.4| 
we argue that test using central intervals are preferable to tests using HPD sets. 
We then turn our attention to tests of composite hypotheses in Section |3] and show 
that inverting credible sets lead to standard tests in this setting. 



2 Testing point-null hypotheses 

2.1 Contrasting the standard solution and credible sets 

Consider tests of the point-null hypothesis ©o = {6'o} against Qi = {6 : 6 ^ 6q}. 
We will discuss two problems associated with the standard solution to point-null 
hypothesis testing; the use of mixture priors and its poor frequentist properties; 
that are overcome by using inverted credible sets instead. We then discuss the 
main drawback of tests based on credible sets, namely the lack of direct measures 
of evidence. In the below, ttq > denotes the prior probability of Bq in the 
standard solution, with vri = 1 — ttq being the prior probability of 9i. 



1. The use of mixture priors in the standard test. One could argue, as Berger 



& Sellke (1987, Section 2), that when using the standard solution it is in general 
reasonable to have ttq > 1/2 so that 9o is not rejected because of its low prior 
probability. This is common in practice. It is however much more sensible to 
adjust the loss function if stronger evidence is required before Gq is rejected. The 
prior distribution should reflect the prior beliefs (or lack of them) and not the 
severity of rejecting Gq. Choosing a large ttq corresponds to using a prior with a 
sharp spike close to 9q. Such priors are only reasonable if there is a very strong 
prior belief in the null hypothesis, and cannot be used in an analysis where at least 
some degree of objectivity is desired. 

When lacking prior information or striving for complete objectivity, it is com- 
mon practice to use a prior that in some sense is non-informative or objective. The 
point-null hypothesis is often a simplification of the hypothesis that \9 — 9q\ < e 
for some small e, in which case ttq = P{\0 — 6q\ < e). If P(|^ — 6q\ < e) is com- 
puted under an objective prior on 6, ttq will invariably be extremely small. The 
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"objective" prior ttq = 1/2 is in fact heavily biased towards 6c 



Berger & Sellke (1987, Section 2) claim that a point-null test only makes sense 



to a Bayesian if the prior actually has a sharp spike near ^o- But it is perfectly 
reasonable to formulate a hypothesis before constructing an informative prior or 
to require that an objective prior is used when no prior information is available. 
It is furthermore reasonable to require that the prior distribution can be used for 
more than one type of statistical decision, e.g. both estimation and hypothesis 
testing; the prior should reflect prior beliefs and not the type of decision that one 
is interested in. 

2. The poor frequentist performance of the standard test. There are many 
situations in which it is reasonable to require that a statistical procedure works well 
from both a Bayesian and a frequentist perspective, particularly when an objective 
analysis is desirable. A well-known complication with the standard solution to 
testing point-null hypotheses is the asymptotic discrepancy between Bayesian and 
frequentist analysis that is known as Lindley's paradox (Lindley, 1957), in which, 
for any fixed prior, P{6 G 6o|a^) goes to 1 as the sample size increases for values of 
a test statistic that correspond to a fixed (small) p- value. Thus a frequentist and a 
Bayesian will reach different conclusions as more and more data is collected. This 
is another example of how the standard Bayesian test favours the null hypothesis. 

There is no such discrepancy for tests based on credible sets. On the contrary, 
when based on non-informative priors, credible sets tend to have favourable prop- 
erties when treated as frequentist confidence intervals; see e.g. Brown et al. (2001 ); 
and hypothesis testing using a credible set is often a valid frequentist procedure. 

3. Measures of evidence. The main argument against the use of credible sets 
for testing is that they do not utilize P(^ e ©ol^;), and so does not really measure 
the amount of evidence for and against Gq. Alas, the first part of this argument 
seems weak, considering the contradictions and paradoxes that the standard tests 
utilizing P(^ G ©o|a;) give rise to. It should also be noted that a test can be fully 
Bayesian without being based on P(6' G 0o|a;) as long at it depends only on the 
posterior distribution. Tests based on inverted credible sets have this property. 

While credible sets do not measure the evidence against 6o directly, there are 
indirect measures associated with such tests. With T{x) being the largest HPD 



set not containing Oq, Pereira & Stern (1999) proposed F{6 ^ T{x)\x), i.e. the 



smallest a for which 9o is not contained in the 1 — a HPD set, as a measure of 
evidence against 6o, with values close to indicating that 6o is false. Similar 
in spirit to the frequentist p-value, this is a measure of how far out in the tails 
of the posterior distribution 6q is. With S{x) being the largest central interval 
not containing 6q, one can similarly use P(^ ^ S{x)\x) as a measure of evidence. 
When T{x) and S{x) are viewed as frequentist confidence intervals, these measures 
of evidence coincide with the p-values of the corresponding two-sided tests. This 
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stands in sharp contrast to the evidence measure P(^ G 0o|a^) utihzed in the 



standard solution, which is irreconcilable with p- values (Berger & Sellke, 1987) 



2.2 The Madruga— Esteves— Wechsler loss and HPD sets 



Pereira & Stern (1999) proposed, along with the p- value-like measure of evidence 
mentioned above, a test of Gq = {6'o} that is equivalent to inverting HPD sets. 



dubbing it the Full Bayesian Significance Test. See also Madruga et al. (2003) 



and Pereira et al. ( 2008 ) . None of the aforementioned papers discuss the historical 



background of this test, briefly outlined in Section [T] of the present paper. 

Let 7r(-) denote the a priori density function of 6 and define T{x) = {6 : 
7r(6'|a;) > 7r(6'o|a;)}. T(x) is the most credible HPD set not containing Oq. The 
Pereira-Stern test rejects Go when P{6 ^ T{x)\x) is "small" (< 0.05, say). In 
other words. Go is rejected at the 5 % level if and only if it is not contained in the 
95 % HPD set. 

Madruga, Esteves and Wechsler (2001) provided a decision-theoretic justifi- 
cation for the Pereira-Stern test, that is non-standard in that the loss function 
depends on x. Let the test function 99 be if Go is accepted and 1 if Go is rejected. 
The Madruga-Esteves- Wechsler loss is: 



^MEW 



{9,(fi,x) 



a{l-I{e eT{x)), ifip{x) = l 
b + cl{9 eT{x)), iiip{x) = 0, 



with a,b,c > 0. Minimization of the expected posterior loss leads to the Pereira- 
Stern test, where Gq is rejected if P{9 ^ T{x)\x) < {b + c)/{a + c). As an example, 
the test resulting from the choice a = 0.975 and b = c = 0.025 is equivalent to 
inverting the 95 % HPD set. 

The controversial part of this decision-theoretic justification of inverting HPD 
sets is that the loss function depends on x. While such loss functions have ap- 



peared in the literature a few times; Madruga et al. (2001) give Kadane (1992) 



and Bernardo & Smith (1994, Section 6.1.4) as examples; they do not appear to 
be widely accepted. Indeed, choosing the loss function after x has been observed 



is arguably not entirely in line with classic decision theory. Pereira et al. (2008) 



argue that an advantage with this type of loss is that it allows the statistician to 
include his or her embarrassment over accepting Go when ^0 is not in a particular 
HPD set. Such loss functions seems applicable in highly subjective analyses, but 
should be avoided when objectivity is desirable. As we will show next, tests based 
on central intervals can, in contrast, be justified using a standard loss functions. 
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2.3 Central intervals 



When testing Go, credible (and confidence) intervals are in practice often used for 
directional conclusions. If is above (or below) the credible interval, it is common 
practice to reject Go and to conclude that 6 is larger (or smaller) than 6*0- 

Next, we provide a decision-theoretic justification of the inversion of central 
intervals that uses a standard loss function not involving x, and for which the 
directional conclusions are fully valid. A justification for the use of HPD sets is 
obtained in a special corollary. 

We formally reformulate the problem of testing the point-null hypothesis Go = 
{^o} as a three-decision problem with directional conclusions. Gq is then tested 
against both G_i = {6 : 6 < 6q} and Gi = : ^ > ^o}- Such problems have 



previously been studied by for instance Jones & Tukey (2000) and Jonsson (2012) 



in the frequentist setting and Bansal & Sheng (2010) and Bansal et al. (2012) in 
the Bayesian setting. 

Theorem 1. Let Gq = {^o}, Q^i = {9 : 9 < 9^} and Qi = {9 : 9 > 9^} . Let 

if be a decision function that takes values in { — 1,0, 1}, such that we accept G^ if 
<f{x) = i. Under an absolutely continuous prior for 9 and the loss function 

{0, if 9 e Qi and ip = i, zG {-1,0,1}, 
a/2, zf9^eoandip = 0, 
1, if 9 e QiLiQo and (p = -i, zG{-1,1}, 

with < a < 1, the Bayes test is to reject Go if 9q is not contained in the central 
1 — a credible set. 

Proof. The expected posterior loss is 

E(L(i)(^, V9)|x) = e(p(^^ G Gi|x)I|_i}(v9) + P{9 G G_i|x)I|i}(<^) 

+ a/2-F{9^Qo\x)l{o}iv)\x 
= e(f{9 G Gi|x)I|_i}((p) + F{9 G G_i|x)I|i}(<^) + a/2 ■ I{o}iv)\x 

since P(^^ ^ Qo\x) = 1 because of the absolute continuity of 9. The posterior 
expected losses are therefore E(L(i)(^, 0)|a:) = a/2 and E{L(^i){9,i\)x) = P{9 G 
Qi\x) for i G { — 1, 1}. It follows that the Bayes decision rule is to accept Go if a/2 < 
minjg{_i 1} P(6' G Gj|x), or, equivalently, if 1 — a/2 > maxjg{_i i} P(6' G Gj|x). But 
P{9 G G_i|x) > 1 — a/2 if and only if the upper bound ^1-0/2(^13^) < 9q, in which 
case 9o is not contained in the central interval. Similarly, P(^^ G Gi|x) > 1 — a/2 
if and only if the lower bound qa/2{9\x) > 9q. Thus Go is accepted if and only if 9 
is contained in the 1 — a central interval. □ 
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It is not uncommon that the posterior density is unimodal and symmetric, 
typical cases of this being normal and Student's t posteriors. In this case, the 
central interval coincides with the HPD set. The following corollary is therefore 
immediate. 

Corollary 1. If 71(6 \x) is absolutely continuous, symmetric and unimodal, the 
Bayes test under is to reject Oo if Oq is not contained in the 1 — a HPD set. 

Typically a < 0.1 in in which case the evidence for either 0_i or ©i has 
to be quite large before Bq is rejected. This is a conservative type of analysis, 
where the statistician is somewhat reluctant to reject Qq. Examples of situations 
where such a loss is applicable include many investigations of hypotheses in science, 
circumstances where falsely rejecting Bq leads to considerable economical losses 
and legal and forensic questions (this loss is in line with the classic in dubio pro 
reo principle). 

The posterior probability of the null hypothesis is not evaluated in this test (it 
is always 0), but the probabilities of the two alternative hypotheses are. By the 
process of elimination, Bq is accepted if neither B_i nor Bi has a high enough 
posterior probability. Bq is rejected if 6*0 is far out in the tails of the posterior 
distribution, making these tests quite similar to frequentist hypothesis testing. 

2.4 Reasons to use central intervals rather than HPD sets 
for testing 

For historical reasons, test based on HPD sets are much more common in the 
literature than tests using central intervals; confer the references given in Section 
[1] for examples. However, from a decision-theoretic perspective, the somewhat 
tautological use of the HPD set T{x) in the loss function Lmew seems artifical 
in comparison to the easy-to- interpret loss function L(i). The Ljvf£;vi/-justification 
of tests using HPD sets involves a good deal more subjectivity than does the 
justification of tests using central intervals. In most situations, the latter test seems 
to be preferable when a test with a decision-theoretic justification is desirable. 

A further argument for prefering inverted central intervals is given by studying 
how the two types of tests rejects the null hypothesis. For ^ e M, when Bq = {6*0} 
is rejected, there is always an implicit directional aspect to this decision: the 
null hypothesis is rejected because it seems more likely that 6 belongs to either 
B_i = {9 : 9 < 9o} or Qi = {9 : 9 > 9o}. When the posterior is skewed, HPD sets 
are, by construction, biased towards one of these sub-alternatives. Such directional 
rejection biases can be avoided by using inverted central intervals for testing. 

When the frequentist properties of test based on HPD and central intervals 
are evaluated in some common point-null testing problems for scale or rate pa- 
rameters, it is seen that this rejection bias causes the HPD set test to have worse 
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Figure 1: Power of two-sided tests of the hypothesis Hq : o"^ = 1, where cr^ is the 
variance of the normal distribution with known mean, based on 95 % credible sets 
using the Jeffreys prior. 
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Figure 2: Power of two-sided tests of the hypothesis Hq : X = 1, where A is the 
rate parameter of the exponential distribution, based on 95 % credible sets using 
the Jeffreys prior. 
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power properties than the central interval test. For inference about the scale/rate 
parameters in the normal, gamma, inverse gamma and WeibuU distributions using 
the corresponding Jeffreys priors, the two tests have comparable power for 6 < 6o, 
while the test using the central interval has higher power for 6 > 6q; sometimes 
substantially so. Examples are given in Figures [T] and [2] the power functions for 
the other cases are similar. In general, the central interval test also seems to come 
closer to attaining the nominal size in this setting. 

A similar issue occurs when the posterior density is monotone, as is for instance 
the case when using the conjugate Pareto prior for 6 in U (0, 6). In this setting the 
HPD set is an upper or lower confidence bound, meaning that it only is possible 
to reject Go in one direction, so that the resulting test in fact is one-sided. 
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3 Testing composite hypotheses 



When performing tests of composite hypotheses, there is typically no discrepancy 
between tests based on credible set and the standard Bayesian tests. This will be 
illustrated with two examples below. For Bq and 61 being uncountable subsets of 
M, both with positive probabilities under an absolutely continuous prior, we will 
consider the weighted 0-1 loss function 

(0, if^ = l-IeM 
L(2)(6',v?) = < a, if 6 e Qo and (p = 1 
[b, if e G 61 and = 0. 

First, we consider one-sided composite hypotheses. The proof of the following 
theorem is in analogue with that of Theorem [T] and is therefore omitted. 

Theorem 2. For the hypotheses Qo = {9 : 9 < 9^} and Qi = {9 : 9 > 9o}, with 
P{Qo), P{Qi) > 0, let if he a test function, such that 0o is rejected if if = 1 and 
accepted if ip = 0. Under an absolutely continuous prior for 9 and the loss function 
L(2) with a = 1 — a and b = a, < a < 1, the Bayes test is to reject 0o if and 
only if 9o is not contained in the lower-bound credible set {9 : 9 > qa{9\x)}, or, 
equivalently, if and only if P{9 G 0o|a;) < «■ 

This is a standard one-sided test and the credible set test is thus in agreement 
with the standard methods. The corresponding justification for upper-bound cred- 
ible sets is given by considering testing the hypothesis Oo = : ^ > ^0} against 
Q^ = {e -.9 < 9o} instead. 

For a general composite hypothesis 9 E Qq with the alternative 9 E Qi = O\0o, 
a correspondence between hypothesis testing and credible test can be established 
using a loss function where falsely accepting the null hypothesis is considered to 
be much worse than falsely rejecting it. Such a loss can be of considerable use in 
situations where false acceptance of the null hypothesis can be costly, for instance 
when screening for diseases, malicious web pages or signs of terrorist activity. 

Theorem 3. For the hypotheses 60 and 61 = 6\9o, with P(6o),-P(6i) > 0, let 
if be a test function, such that Bq is rejected if (p = 1 and accepted if cp = 0. 
Under an absolutely continuous prior for 9 and the loss function L(2) with a = a 
and 6 = 1 — a, < a < 1/2, the Bayes test is to reject Gq if and only if there 
exists at least one a credible set that does not contain a non-null subset ofQo, or, 
equivalently, if and only if P{9 G 0o|a;) < 1 — a. 



Proof. By Proposition 5.2.2 of Robert (2007), the Bayes test is to accept 9o if 



P{9 G 0o|2;) > 1 — a. Now, a credible set Qc is a set such that P(0c|x) > 1 — a. 
Thus, by definition, if 9o is such that P(6o|a;) > 1 — «, Oc can be a credible set 
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only if P(0o D Qc\x) > 0. Hence, we accept the null hypothesis if and only if each 
1 — a credible set contains a non-null subset of Gq. □ 



Once again, this is a standard test, albeit a less commonly used one. Using 
this loss function corresponds to the statistician being either unsure about which 
credible set to report or unwilling to accept ©o if there exists at least one way to 
look at the data that puts ©o to question. 
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